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Abstract. We present an exact solution for the dynamics of on-line Hebbian learning 
in neural networks, with restricted and unrealizable training sets. In contrast to 
other studies on learning with restricted training sets, unrealizability is here caused by 
structural mismatch, rather than data noise: the teacher machine is a perceptron with 
a reversed wedge- type transfer function, while the student machine is a perceptron with 
a sigmoidal transfer function. We calculate the glassy dynamics of the macroscopic 
performance measures, training error and generalization error, and the (non-Gaussian) 
student field distribution. Our results, which find excellent confirmation in numerical 
simulations, provide a new benchmark test for general formalisms with which to study 
unrealizable learning processes with restricted training sets. 



PACS numbers: 87.10.+e 

On-line learning processes in artificial neural networks have been studied using statistical 
mechanical techniques for about a decade now (see e.g. |], p| for reviews). Initially, most 
dynamical studies were restricted to the regime where the number of training examples 
is larger than the number of learning steps, since this generally leads to Gaussian 
field distributions and relatively simple non-glassy dynamics. In practical situations, 
however, it is usually difficult to acquire large training sets, and one is therefore often 
forced to recycle the data in the training set. The latter situation, characterized by the 
presence of disorder (the composition of the training set) and non-trivial dynamics, was 
studied in e.g. 0, |J (for binary weights) , and in 0, |], [7], [|, |9|, [L0| (for continuous weights) . 



These studies generally involve approximations at some stage. This motivated [I I 
where it was shown how for the special case of on-line Hebbian learning the dynamics 
can be solved exactly (for restricted training sets), providing an excellent benchmark 
for general theories and approximation schemes. Some of the studies mentioned above 
involved learning from restricted but unrealizable training sets, where it is impossible 
for the student to achieve perfect performance, even if an infinitely large training set had 
been available. This could result from corruption by noise of realizable data (as in e.g. 
H |TTJ ) , or from structural mismatch between teacher and student. A typical toy model 
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to realize the latter situation is obtained by using a perceptron with a reversed wedge 
transfer function as a teacher machine to train an ordinary perceptron |12|, [13|, [14]] (note: 
there is also a relation between simple perceptrons with reversed wedge transfer functions 
and the so-called parity machines). Since all dynamical studies with restricted but 
unrealizable training sets have so far been carried out only for the data noise scenario, 
it would be of considerable interest to investigate exactly solvable models with restricted 
training sets but unrealizability due to structural mismatch. In this letter, we carry out 
such a study: we solve the dynamics of on-line Hebbian learning from unrealizable 
restricted training sets, for a teacher-student scenario where teacher and student have 
different transfer functions (a reverse- wedge and a sigmoidal one, respectively). 

We investigate on-line learning in a ordinary student perceptron S (whose weight 
vector is denoted by J), which tries to learn a task defined by a teacher perceptron T a 
(whose weight vector is denoted by B) . The teacher is equipped with a reversed wedge 
transfer function, i.e. T a (y) = sgn[y(a — y)(a + y)} where y = B ■ £ and £ G { — 1, 1}^ 
is the input vector, whereas S(x) = sgn[x] with x — J ■ £. The teacher's weight vector 
B is normalized such that B 2 = 1, with Bi = 0(N~ 1 ^ 2 ) for each i. It is clear that in 
the limits a — > and a — > oo (where a characterizes the width of the reverse wedge) the 
task becomes realizable for the student, since T (y) = sgn[— y] and T^y) = sgn[y]. 

We define the conventional order parameters Q[J] = J 2 and R[J] = B ■ J. 
One of the main quantities of interest is the generalization error E g , the probability of 
disagreement between teacher and student for input vectors taken randomly from the 
full set of all possible inputs: 

ra poo 

E g = (6[-T a (y)S(x)])t = / Du erf[r + ( M )] + / Du erf[r»], (1) 



where r ± (-u) ee ±Ru/ ^2(Q — R 2 ), Du ee (2tt) Idu e n2//2 , 0[- • •] is the step function, 
and (•■■)* denotes averaging over all £ G {—1,1} . It was shown in |Tj| that the 
optimal normalized overlap r = R/y^Q (giving the smallest value of the generalization 
error) equals 1 as long as the reversed wedge parameter a is greater than a = a c \ = 0.8; 



r suddenly changes from 1 to r* = — y (2 log 2 — a 2 )/21og2 at a = a c \. 

For this model system, we use the following on-line Hebbian learning rule 

J(£ + 1) = (l - J(£) + ^T^e [t) (2) 

where I indicates the learning step, and r\ and 7 represent the learning rate and the 
weight decay, respectively. The student learns from data picked randomly from the 
restricted training set D = {(£ M , T£), \i — 1, . . . , p — Na}. 

To calculate macroscopic physical observables, averaged over the disorder (the 
composition of the training set) at any time, we need to distinguish between two 
averaging procedures The first is the average over all possible 'paths' Q = 

{yu(0), /i(l), • • ■ , n{€), ■ ■ •} defining the actual sampling order from the training set: 

{f{e {l \T a ))n = -j:f{e,T a ) (3) 
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The second is averaging over all training sets: 

x /[(M),.--,^,^)] (4) 

The key to the full solvability of the present model, as in jD]], is the fact that (0) allows 
us to write J(m) (the student's weight vector at m-th step) in explicit form as 

J(m) = a m J(0) + ^ ^^^T^WfM (5) 

where a = (1 — T)/N). The above averaging procedures can now be carried out exactly. 

In order to evaluate the training time dependence of the generalization error flU), 
following [JO], we first calculate the following two macroscopic observables 

Q(t) = Jim ((Q[J(m)]) n ) sets , = lim ((i?[J(m)]) Q ) sets , (6) 

where t = m/N. Squaring (||]) gives 



((Q[J(m)))uU s = a 2Nt Qo + | t ^-'((Jo ■ e {i) T^h) 



2 m-1 



sets' 



After calculating the averages (• • -)n and (• • -^ets, and taking iV — > oo, we then obtain 
Q(t) = e- 2 ^Qo + ?5^ e -^(l - e-T*) + £(1 - e" 2 *) 

+ ^(l/« + pD(l-e-^) 2 (7) 

where we defined p a = ((v ■ £)T a (B ■ = ^2/tt(1 - 2 e~ a2 / 2 ). The quantity p a 
represents a kind of effective noise induced by the reversed wedge of the teacher. In a 
similar manner we obtain an exact expression for the student-teacher overlap R(t): 

R(t) = e-^Ro + —(1 - e->*). (8) 

7 



The length of the component of J which is orthogonal to B, y/Q — i? 2 , is seen to remain 
independent of p a . This is easily understood. The components of the input vectors which 
are orthogonal to B are uncorrelated with the training outputs, so their evolution is 
not modified by the effect of the reversed wedge. From in turn, we immediately 

obtain the generalization error at any time, via ([[]). For t — > oo this becomes 

pa roc 

timE g = / Du erf[r+(u)] + / Du erf[r~(«)] (9) 

t-*oo Jq J a 

with rf{u) ee ±p a u/ y7 + 2/a. In Figure [TJ, we show the asymptotic value of E g for 
a — > oo (where we recover the unrestricted training sets behaviour), for different 7. 
We see that for 7 = 0, E g converges to 2erf(a/v / 2) for a < a c2 = -^2 log 2 and to 
1 — 2erf(a/v / 2) for a > a c2 , with an asymptotic scaling form E g ~ or 1 ! 2 as a -> 00 
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Figure 1. Left: The asymptotic generalization error as a function of the width of the 

reversed wedge a in the limit of a — > oo, for 7 = ( ), 7 = 0.5 ( ) and 7=1 

( ). We chose 77 = 1. Right: The corresponding normalized overlap r = Rj\fQ 

which gives the generalization error in the left figure. The best possible values for the 
generalization error and the optimal normalized overlap are shown by thick lines. 



I3fl . On the other hand, for 7 > 0, E g converges to E*\ a=oc as E g ~ a" 1 . For 
7 = and a < a C 2, the asymptotic generalization error E g is seen to be larger than that 
corresponding to random guessing (over-training) |E| . When we introduce weight decay 
this phenomenon disappears. An optimal weight decay, minimizing the asymptotic E g , 
exists for a < a c2 and is given by 7 opt = 2a 2 p 2 J{2 log 2 — a 2 ). 

For finite a and short times, t <C I/7, we can expand (|7|) and (Q) with respect to 
jt and find R(t) = R + r]p a t, Q(t) - R 2 (t) = Q - R 2 + rft + 7ft 2 /a. In this regime 
the training time is too short for weight decay to have an effect. For t ^> I/7, on the 
other hand, it is clear from (0) and ® that the order parameters Q, R decay to their 
asymptotic values exponentially. For the case of 7 — > 0, the small jt expansions are 
valid for all time. Upon expanding E g with respect to 7 we obtain 



E ~ 

Jllg — 



Dx erf 



-paX 



+ 



Dx erf 



-PaX 



+ 



a 



2U 2n(l 



a Pl) 



We next turn to the student field distribution Pt{x). If the number of examples 
in the training set is much larger than the number of training steps (i.e. for a — ► 00), 
the student fields x = J ■ £ are described by a Gaussian distribution, due to the central 
limit theorem. For a < 00 however, where the training sets are restricted and questions 
are recycled during the training process, complicated correlations build up and the field 
distribution generally acquires a non-Gaussian shape. In order to determine Pt(x), we 
first calculate the joint distribution for student fields, teacher fields, and outputs (with 
x,y G 3H and T a £ {-1, 1}): 



P(x,y,T a )= lim -j2((S(*-J-e)6(y 
N^oo r> — : 
f ii=\ 



B-e)S Ta ,TZ >n> 



sets i 
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Its characteristic function is 

P t (x,y,f) = (e-^ + yy +fT ^) Pti _ ) 

where (f(x, y, T a ))p t (.,,) = JdxdyY,T a =±iPt(x,y,T a )f(x 1 y,T a ). Working out the dlD 
averages we obtain 

P t {x, y, f) = Km (- ]T exp[-^a M Jo ■ - ij/B ■ ~ iTT%} 

Nt 



A / / n/ sets 

By using the general relation 

P t {x,y,f) = fdxdy £ e-^'+^+^P^xl^TjP^Tj (10) 

T Q =±1 

and some further algebra, following closely the procedure outlined in [JTTJ] (to which we 
refer for details), we then obtain the probability density Pt{x) as 



P t {x)= jDy £ P t (x|y,T a )P(T a |y) 



T tt =±l 

— e-f^+xrW cos^a;) [cos(xi(£)) + G(xR)sm.{x\{x))} 
2n 

A _a? rdx _a^ 2 , v ,^ s .f x ^dX a£ 

— 4e 2 / — e 2 +Xrl j cos(xx) sm(xi(a;)) / — ;=e 2 cos(aA) (11) 

J Z7T JO V27T 

where we defined the functions Xr, Xi an d G as 

Xr(£) = — / ds {cos [?7xe~ 7( '*~ s ' ) ] — 1 \ , Xi(^) = / sin [r/xe -7 ^^] 

a io 1 ' a Jo 

'■a dA 



G(A) = e 2A / Dz sin(Alzl) = 2 / ^=e^. 

J JO \/27T 



/27T 

It follows from (|TT| ) that -Pt(x) is a symmetric function of x, for all times and all 
values of the reversed wedge width a. In the special cases a = oo and a = (where the 



task becomes realizable) we find our result (|TTj) reducing to that of |T 



P t (ar) = / — e -iQ* 2 +x^) cos(xx) {cos^)) + (1- 2X)G(xR) sin(xi(x))} 
J 2ix 

with A = for a = oo, and A = 1 for a = 0. This is consistent with |fLl| |, where 
the parameter A denoted the probability that a teacher output was corrupted by noise. 
Here we find that, if the width of the reversed wedge is zero, the transfer function of 
the teacher is the inverse of that of the student, and the the output of the teacher can 
be regarded as a noisy output with flip probability A = 1. In contrast, in the general 
case < a < oo, equation ([0]) shows that the effect of structural non-realizability can 
not be described by an 'effective' output noise. In Figure we plot Pt{x) as given by 
equation (Pj), for a — 1, 77 = 1 and 7 = 0.5, at different points in time, and we compare 
the result to the corresponding observations in numerical simulations (histograms). One 
clearly observes how Pt(x) evolves from a Gaussian distribution at t = to a manifestly 
non- Gaussian one. 




Figure 2. The student field distribution Pt(x) generated during on-line Hebbian 
learning, from a teacher with a reversed wedge of width a = 1 and for 77 = 1, 
a = 0.5 and 7 = 0.5, at times t S {1,2,3,4}. Solid lines: the theoretical result (|ll|). 
Histograms: results obtained via computer simulations for systems of size N = 10000. 



Finally, we calculate the training error E t , which measures the average fraction of 
errors made by the student on inputs taken from the training set. It is given by 

E t = ffdxDy £ e[-T a (y)S(x))P t {x\y,T a )P(T a \y). 

T a =±l 

By using (|1(]) and ( [TT]) we can obtain the explicit form of E t as 

Et = \~ I ^-^ 2+M£) iG(xR) cos(xi(4)) - sm( Xi (x))} 

+ 4e"^ [^e-^ 2+ ^ C os( X i(x)) f*-^Le^ cos(aA). (12) 

J ZTTX JO V27T 

In Figure |3|, we plot both the training error ([12]) and the generalization error (|l|) for four 
different values of the width a of the teacher's reversed wedge, viz. a G {0.0, 0.5, 1.0, 1.5}. 
In all case we find the theoretical results and the computer simulations to be in excellent 
agreement. In the limit t — ► 00 we also observe that the asymptotic values of both E g 
and E t indeed approach E* (see @) for increasing a, as it should. 

In conclusion, in this letter we have solved the dynamics of on-line Hebbian learning 
with structurally unrealizable restricted training sets exactly, for the case where a stan- 
dard perceptron is being trained by a teacher perceptron with a reversed wedge transfer 
function. Although our solution applies only to Hebbian learning (as did the one in 
|nj)> we believe that our results provide a valuable new benchmark against which to 
test (approximations made in) more general formalisms, such as generating functional 
analysis 0, £|, [L(|, dynamical replica theory |5|, [(J or the cavity method 0. 



The authors would like to thank King's College London (JI) and the Tokyo Institute of 
Technology (ACCC) for their hospitality. 
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Figure 3. Training errors E t and generalization errors E g as functions of time, for 
different values of a. In all cases r\ = 1 and 7 = 0.5, with initial conditions Qq = 1 and 
E g (t = 0) = 0.5. From the upper left panel to the lower right panel: a — 0.0, 0.5, 1.0 
and 1.5. In each panel, the upper three solid lines indicate our theoretical results 
of Eg, together with the corresponding results of computer simulations: □(a = 0.5), 
■(a = 1.0) and • (a = 4.0). The lower three lines are theoretical results for Et, 
compared to the results of computer simulations, with D(a = 0.5), m(a = 1.0) and 
• (a — 4.0). All simulations are carried out for systems of size N = 5000. 
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